Correlation Clustering Revisited: The "True" Cost of Error Minimization Problems

نویسندگان

  • Nir Ailon
  • Edo Liberty
چکیده

Correlation Clustering was defined by Bansal, Blum, and Chawla as the problem of clustering a set of elements based on a possibly inconsistent binary similarity function between element pairs. Their setting is agnostic in the sense that a ground truth clustering is not assumed to exist, and the only reasonable way to measure the cost of a solution is by comparing it with the input similarity function. This problem has been studied in theory and application and has been subsequently proven to be APX-Hard. In this work we assume that there does exist an unknown correct clustering of the data. This is the case in applications such as record linkage in databases. In this setting, we argue that it is more reasonable to measure accuracy of the output clustering against the unknown underlying true clustering. This corresponds to the intuition that in real life an action is penalized or rewarded based on reality and not on our noisy perception thereof. The traditional combinatorial optimization version of the problem only offers an indirect solution to our revisited version via a triangle inequality argument applied to the distances between the output clustering, the input similarity function and the underlying ground truth. In the revisited version, we show that it is possible to shortcut the traditional optimization detour and obtain a factor 2 approximation. This factor could not have possibly been obtained by using a solution to the traditional problem as a black box, unless it was an exact optimal solution. Our result therefore shortcuts the APX-Hardness, and could be useful for revisiting many other combinatorial optimization problems. Our analysis consists of two solutions. The first gives a simple 2-approximation algorithm. The second involves a novel way to continuously morph a general (non-metric) distance function into a metric. This technique is interesting in its own right and may be useful for other metric embedding problems. The resulting morphed solution is randomly rounded into a clustering. En route, in certain cases we obtain a certificate for the possibility of getting a solution of factor strictly less than 2. Finally, we show simple cases in which randomness is necessary for achieving a solution of factor strictly less than 2, thus justifying the use of randomization in our solution.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Magnetic Calibration of Three-Axis Strapdown Magnetometers for Applications in Mems Attitude-Heading Reference Systems

In a strapdown magnetic compass, heading angle is estimated using the Earth's magnetic field measured by Three-Axis Magnetometers (TAM). However, due to several inevitable errors in the magnetic system, such as sensitivity errors, non-orthogonal and misalignment errors, hard iron and soft iron errors, measurement noises and local magnetic fields, there are large error between the magnetometers'...

متن کامل

A Projected Alternating Least square Approach for Computation of Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) is a common method in data mining that have been used in different applications as a dimension reduction, classification or clustering method. Methods in alternating least square (ALS) approach usually used to solve this non-convex minimization problem.  At each step of ALS algorithms two convex least square problems should be solved, which causes high com...

متن کامل

The Effects of Newmark Method Parameters on Errors in Dynamic Extended Finite Element Method Using Response Surface Method

The Newmark method is an effective method for numerical time integration in dynamic problems. The results of Newmark method are function of its parameters (β, γ and ∆t). In this paper, a stationary mode I dynamic crack problem is coded in extended finite element method )XFEM( framework in Matlab software and results are verified with analytical solution. This paper focuses on effects of main pa...

متن کامل

Missing data imputation in multivariable time series data

Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of diffe...

متن کامل

Energy cost minimization in an electric vehicle solar charging station via dynamic programming

Environmental crisis and shortage of fossil fuels make Electric Vehicles (EVs) alternatives for conventional vehicles. With growing numbers of EVs, the coordinated charging is necessary to prevent problems such as large peaks and power losses for grid and to minimize charging costs of EVs for EV owners. Therefore, this paper proposes an optimal charging schedule based on Dynamic Programming (DP...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009